Skip to content

Conversation

@alissawu
Copy link
Collaborator

@alissawu alissawu commented Oct 13, 2025

Closes #38
📌 What’s Changed

  • Added scraper dashboard for monitoring job and error status
  • Wrote queries to obtain job, error data, with joining to cross reference errors to job types
  • Added HTML escaping to prevent XSS vulnerability
  • Added seed.ts file with simulated data, can seed into database with bun run seed
  • Implemented visual stats bars with job/error breakdowns
  • Added client-side filtering by job type, status, and error type
  • Added URL search for both jobs and errors
  • Added pagination (12 items per page)
  • Added expandable error details for failed jobs
  • Added auto-refresh (30 seconds) with manual refresh button
Screen.Recording.2025-10-13.at.12.32.51.AM-2.mp4

✅ Actions

  • Possibly make "failed jobs" more prominent by highlighting it, or putting a red dot / something to indicate it is expandable
  • Add authentication - basic user/password authentication or cloudfare access, only haven't done bc I'm not sure which the team needs.
  • Make css pixel values less hardcoded?
  • Make auto-refresh timeframe an input? (30sec rn) (i think this is fine tbh)

📝 Notes for Reviewer

  1. To run:
    navigate to scraper directory
    bun run db:generate
    bun run db:migrate:local
    bun run seed (this seeds ur local db w the generated data)
    bun run dev
    go to the wrangler:info url in terminal

  2. Questions / issues

  • The biome check does not pass because of dangerouslySetInnerHTML, but this is intentional, since we need to inject client-side JS for the dashboard refresh and pagination. It's static and server-controlled, we don't embed any dynamic user input, so it's safe from XSS attacks. I also added HTML escaping to further prevent XSS attacks just for future safety. I added a warning suppressor, at the end of components.tsx.
    Notes on assumptions:
  • I used innerJoin for the error table (on jobs) when querying errors. This assumes every error has a valid job. Ie if an error exists, it corresponds to a job in the job database
    • The issue is if an error doesn't correspond to a job, it won't be queried. However I feel like an error unrelated to a job got resolved - not sure about this assumption.
    • There should be error deletion handling. When an error is resolved, it should probably be deleted. If a job is completed, it shouldn’t retain the errors (imo). This is not implemented yet though, so idk how to insure around the edge cases.
  • Currently, I've set a query amount. I’ve set jobs to 500 and errors to 100. This can be changed of course.
  • I made the decision to make filtering etc client side bc 1) faster (no network round trip) and 2) lower D1 query cost, as I’m not sure how many queries per day we r expecting. I’m assuming there will be under like 20k jobs / errors to query, so client-side is still extremely fast.
  • Right now, auto-refresh (30 sec) is an option, but by default, refresh is a manual button. This is to save query costs, and also it's fine because auto refresh can be turned on. I can change auto to default though if wanted.
  • CSS heights etc are hardcoded, may make relative later. We're all on computers though so I don't think it'll vary that much, but tbd

@chenxin-yan chenxin-yan changed the title Scraper status dashboard feat(scraper): add scraping status dashboard Oct 13, 2025
Copy link
Collaborator

@chenxin-yan chenxin-yan left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Great job! looking look with things to consider

const REFRESH_INTERVAL = 30000; // 30 sec
const ITEMS_PER_PAGE = 12;
// goes inside <script> tags
const scriptContent = `
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

it is not ideal and elegant to have large js code as string like this. you could move it to a separate js file and run it as service worker on user's browser. I haven't look into how to do it with cloudflare workers and hono. You can look Into it.

Also, currently we are long polling data from the database which is also not ideal. You should look into cloudflare durable object to use web socket for real time update instead of long polling. Let me know if you need any help with it

@chenxin-yan
Copy link
Collaborator

to answer some of your questions:

Possibly make "failed jobs" more prominent by highlighting it, or putting a red dot / something to indicate it is expandable

its totally up to you. Looks good so far.

Add authentication - basic user/password authentication or cloudfare access, only haven't done bc I'm not sure which the team needs.

you can just ignore auth for now, as there is no sensitive info displayed in the dashboard and dashboard is purly presentational. it should be fine.

Make css pixel values less hardcoded?

Its up to you

Make auto-refresh timeframe an input? (30sec rn) (i think this is fine tbh)

as I suggested in code review, checkout durable object and we can have a web socket connection to provide real time update instead of long polling

The issue is if an error doesn't correspond to a job, it won't be queried. However I feel like an error unrelated to a job got resolved - not sure about this assumption.

you can safely make this assumption as we will not delete any job, and when creating error record the job must exists

There should be error deletion handling. When an error is resolved, it should probably be deleted. If a job is completed, it shouldn’t retain the errors (imo). This is not implemented yet though, so idk how to insure around the edge cases.

imo, we shouldn’t delete any error record just for monitoring/debugging purposes. e.g. checking how much retries did a given job take to succeed and the error details and whatnot.

I made the decision to make filtering etc client side bc 1) faster (no network round trip) and 2) lower D1 query cost, as I’m not sure how many queries per day we r expecting. I’m assuming there will be under like 20k jobs / errors to query, so client-side is still extremely fast.

Yes, cilent side filtering is good for this case.

CSS heights etc are hardcoded, may make relative later. We're all on computers though so I don't think it'll vary that much, but tbd

It's up to you. Feel free to hold off on this for now.

@chenxin-yan chenxin-yan force-pushed the main branch 4 times, most recently from 7072356 to 503ca3f Compare October 19, 2025 02:11
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

scraper dashboard to show scraping status

3 participants